KIPEs3: Automatic annotation of biosynthesis pathways

Flavonoids and carotenoids are pigments involved in stress mitigation and numerous other processes. Both pigment classes can contribute to flower and fruit coloration. Flavonoid aglycones and carotenoids are produced by a pathway that is largely conserved across land plants. Glycosylations, acylations, and methylations of the flavonoid aglycones can be species-specific and lead to a plethora of biochemically diverse flavonoids. We previously developed KIPEs for the automatic annotation of biosynthesis pathways and presented an application on the flavonoid aglycone biosynthesis. KIPEs3 is an improved version with additional features and the potential to identify not just the core biosynthesis players, but also candidates involved in the decoration steps and in the transport of flavonoids. Functionality of KIPEs3 is demonstrated through the analysis of the flavonoid biosynthesis in Arabidopsis thaliana Nd-1, Capsella grandiflora, and Dioscorea dumetorum. We demonstrate the applicability of KIPEs to other pathways by adding the carotenoid biosynthesis to the repertoire. As a technical proof of concept, the carotenoid biosynthesis was analyzed in the same species and Daucus carota. KIPEs3 is available as an online service to enable access without prior bioinformatics experience. KIPEs3 facilitates the automatic annotation and analysis of biosynthesis pathways with a consistent and high quality in a large number of plant species. Numerous genome sequencing projects are generating a huge amount of data sets that can be analyzed to identify evolutionary patterns and promising candidate genes for biotechnological and breeding applications.


Background
Carotenoids and flavonoids are two important classes of plant pigments.Colorful flowers are often pigmented by one or both of these pigments.Flavonoids (anthocyanins) are known for blue, purple, red, and orange colors, while carotenoids can provide yellow, orange, and red pigmentation.Biological roles of carotenoids include photoprotection [1,2], light capture [3], coloration of plant structures [4][5][6], and hormone precursors [7].Therefore, carotenoids are considered part of the primary metabolism.In contrast, flavonoids are specialized plant metabolites that contribute to stress resilience and play important roles in the reproduction of many plants [8][9][10].The biosynthesis of flavonoid aglycones [11,12], the decoration with sugar and acyl groups [13][14][15], and the intracellular transport of flavonoids [16][17][18] have been reviewed elsewhere.The core of the flavonoid biosynthesis pathway is well conserved across land plant species, whereas the decorating reactions are often species-specific.Glycosylation and acylation modify the properties of flavonoids, but specific properties depend on the added sugar moiety or acyl group.Glycosylation can enhance the anthocyanin stability [19][20][21] and be required for intracellular flavonoid transport [17,18].Acylation can also boost the stability of anthocyanins [22].Applications of anthocyanins as food colorants rely on a high stability.
Biotechnological applications involving flavonoids are generally facilitated by knowledge about the genes involved in their biosynthesis.For example, such insights are crucial for the production of flavonoids in heterologous systems [23].The flavonoid biosynthesis pathway is well studied in several model organisms and crops like Arabidopsis thaliana [18,20], Petunia hybrida [11,24,25], Antirrhinum majus [26,27], Vitis vinifera [28][29][30], and Zea mays [20,31].An ongoing challenge is the efficient transfer of this knowledge to other plant species.In fact, the functional annotation transfer challenge goes far beyond knowledge about the flavonoid biosynthesis.Plant genome sequences are becoming available with an increasing pace [32,33], but the functional annotation is lacking behind.Automatic approaches for biological interpretation of these data sets are crucial to exploit the full value of genomic resources.There are several comprehensive databases collecting and sharing knowledge about plant metabolism, e.g.The Plant Metabolic Network [34], Plant Reactome [35], and MetaCyc [36].Tools like Inter-ProScan5 [37], Blast2GO [38], KEGG Automatic Annotation Server [39], and Mercator4 [40] allow the assignment of certain annotation terms to input sequences.While these tools are able to annotate genome-scale data sets, they are less specific with respect to the assigned annotation.Researchers looking for a specific pathway could benefit from tools focusing on their pathway of interest.
Here, we present an updated approach for the automatic annotation of the flavonoid biosynthesis genes in any land plant species of interest based on prior knowledge about the pathway.We also demonstrate that this approach can be applied to other pathways by investigating the carotenoid biosynthesis genes in multiple land plant species.A web server makes this automatic annotation workflow available to the research community.

Run time analysis of KIPEs3
The average run time of KIPEs3 per data set was assessed using a previously defined set of 121 plant species [47].All these sequence files were retrieved from Phytozome [48].The analysis was performed on coding sequences to represent transcriptome assemblies.Performance of KIPEs3 with FastTree v2.1.10 [45] was analyzed using default parameters.Run time and the number of sequences in the analyzed data sets were documented.Run times were measured for analyses with the construction of phylogenetic trees for all final candidate genes.
adherence to PLOS ONE policies on sharing data and materials.

Major technical updates
The initial Python2 implementation of KIPEs [12] was converted into a Python3 implementation.Syntax changes and the replacement of modules were an integral part of this transformation process.This technical upgrade ensures that more users are able to run KIPEs without installing an outdated Python version.It is also important for long term preservation of KIPEs and to avoid security issues.While upgrading the code to the latest Python version, we also added new features.
The established KIPEs components form the basis of KIPEs3 starting with the identification of orthologs for all bait sequences in the new species of interest (Fig 1).Each step in the pathway (enzyme/gene) is represented by one FASTA file containing numerous bait sequences Table 1.Flavonoid biosynthesis steps in the KIPEs3 repertoire.Steps added during the development of KIPEs3 are indicated with an asterisk next to the label.Steps of the core flavonoid aglycone biosynthesis were part of the initial KIPEs release [12].Required bait sequences and related information are available via GitHub [55].
A new option allows users to provide expression data that are subjected to an all-vs-all coexpression analysis among the identified candidates.It is generally expected that genes associated with following or preceding steps of the same branch in a biosynthesis pathway show a similar expression pattern, i.e. co-expression with other genes in this pathway branch.A strong co-expression value can further boost the reliability of identified candidates.KIPEs3 will produce a file that can serve as Cytoscape [74] input to visualize the gene expression network (Fig 2).This co-expression feature can also help to identify the most important candidate if multiple close paralogs exist for a given function.Many KIPEs users expressed an interest in phylogenetic trees that place the identified candidates in the context of previously characterized homologous sequences.These trees can help to explore the relationships of multiple candidates to previously characterized bait or reference sequences.KIPEs3 can automatically construct trees of the identified candidates and the corresponding bait sequences for all steps in the pathway.This enables a manual inspection without the need for additional phylogenetic analyses.KIPEs3 generates files in the Newick format which can be visualized and modified for publication with established tools like FigTree or iTOL [75].
KIPEs is able to start with coding sequences or with peptide sequences.It is even possible to provide a genome sequence, but performing the gene prediction with dedicated tools is recommended for most applications.Generally, it is best to provide peptide sequences as this speeds up the analysis.Users define the type of input sequences that they are providing.Since this information might not always be accurate, an option was added to check if the provided sequences comprise amino acids or nucleotides.A warning is shown if the result of this analysis contradicts the sequence type set by the user.However, it remains unfeasible to distinguish a collection of coding sequences from a highly fragmented genome sequence assembly based on short reads.Therefore, it remains the users' responsibility to set the correct sequence type.
Given the power of modern hardware, the computational resources required for running KIPEs are neglectable.A single core and less than 2GB of memory are already sufficient for the analysis.The entire analysis is completed within minutes, but the precise run time depends on the specific settings and size of the analyzed data set (Fig 3).Especially large data sets can cause run times of about one hour, but most plant species can be analyzed in less than 20 minutes (Fig 3A).This run time analysis was performed on Phytozome data sets of a wide range of plant species to simulate a realistic application of KIPEs3.

Comparison of KIPEs3 to other databases
MetaCyc, Plant Metabolic Network (PMN), and Plant Reactome are comprehensive databases describing the pathways, enzymes, and substrate compounds.While MetaCyc contains information about all domains of life, PMN and Plant Reactome are specific to plants and green algae.KIPEs3 is a specialized tool which provides individual gene level annotation for a user defined biosynthesis pathway in any species.MetaCyc contains only genes with experimentally determined functions in the represented pathways.PMN comprises 126 species-specific datasets and houses PlantCyc which covers over 500 additional plant species.PlantCyc harbors experimentally supported and predicted gene functions.The predicted enzymes are classified using the Ensemble Enzyme Prediction Pipeline (E2P2) software and RPSD v4.2 [76].E2P2 uses BLAST [77] and PRIAM [78] to assign functions using sequence similarity to previouslyknown genes sourced from MetaCyc, BRENDA [79], and SwissProt [80] among others.The genome sequences used for this annotation are selected on the basis of at least 75% BUSCO (Benchmarking Universal Single-Copy Orthologs) completeness [34].Plant Reactome uses Oryza sativa as a reference species to predict enzymes in 120 other species currently available in the database.Enzyme prediction in a species is based on orthologs of that enzyme in rice.Enzyme prediction in KIPEs3 is more stringent as it starts with ortholog identification based on multiple experimentally identified genes defined in literature (bait sequences).Ortholog identification is followed by an in-depth sequence similarity analysis including a check of the presence of functionally relevant amino acids.Only those candidates which pass all these steps are reported as most likely functional.Moreover, KIPEs3 can also identify similar candidates that do not fulfill all selection criteria.Contrary to most databases, researchers can use their own assembled dataset when running KIPEs3.This option enables the functional annotation of new genome sequences to identify a particular pathway of interest, given some prior knowledge about the pathway.Some features of these databases compared to KIPEs are summarized in Table 2.

Functional annotation performance of KIPEs3 compared to other databases
We compared the annotation results for FLS and LYC-b in terms of correctness of the annotation.Functional annotation correctness of a candidate can be assessed (a) in terms of correct amino acid residues required for a functional enzyme, (b) in terms of its position in a phylogenetic tree with characterized orthologous sequences, and (c) in terms of expression pattern.A higher gene expression is usually suggesting functional relevance of a given candidate.Coexpression with other members of the same biosynthesis pathways is also helpful in identifying the most promising candidates.We followed a combined approach for LYC-b and FLS annotation.The annotation correctness was assessed by the co-expression of a given candidate with other genes of the flavonoid or carotenoid biosynthesis, respectively.The presence of functionally relevant amino acid residues served as the final and deciding factor for declaring a candidate gene as "functional".Lycopene β cyclase (LYC-b) is one of the key enzymes in the carotenoid biosynthesis pathway and catalyzes the β-cyclization of both ends of lycopene to produce β-lycopene.LYC-b is known to have conserved domains such as "dinucleotide binding motif", "Cyclase motif I and II", "Conserved region β-LYC", and a "β cyclase motif" [81].Since studies about the amino acid residues responsible for catalytic sites are still lacking, a phylogenetic tree of LYC-b candidates was constructed to identify likely functional orthologs.The carotenoid genes, neoxanthin synthase (NSY) and lycopene ε-cyclase (LYC-ε), were taken as an outgroup due to their high structure similarity to LYC-b [82] (S2 and S3 Files).
We used two metrics (specificity and accuracy) to characterize the annotation performance of the different tools and databases.Specificity describes how well the database/tool can annotate a functional enzyme.It is calculated as the ratio of the number of correct candidates predicted by the database/tool to the total number of functional candidates present in that species.Accuracy denotes the ability of a database/tool to discriminate between functional and nonfunctional candidates.It is calculated as the ratio of correct predictions to the total predictions made by a database/tool for a given species and a gene in a pathway.Finally, the annotation performance is calculated as the average sum of specificity and accuracy.For LYC-b annotation, we found that the performance of Plant Reactome and KIPEs was 100% in all three species, Arabidopsis, tomato, and carrot.MetaCyc showed 100% annotation performance for Arabidopsis, 75% for tomato, and dropped to 0% for carrot as it did not annotate any LYC-b gene in carrot.Despite the high specificity (100%) of Plant Metabolic Network, the performance was between 65-75% for all three species due to low accuracy.(Fig 4(A)-4(F)).KIPEs3 maintained its 100% performance in all three species, Arabidopsis, chickpea, and orange, used for FLS annotation as well.Like KIPEs3, MetaCyc provides a correct Arabidopsis FLS annotation.However, MetaCyc does not report any FLS in chickpea and orange, while KIPEs3 identified candidates in both species.Plant Reactome was neither specific nor accurate in FLS annotation and failed to annotate any candidate in any of the above species.Plant Metabolic Network showed low performance (~0-60%) again due to low accuracy (Fig 4(G)-4(I)).
Plant Metabolic Network exhibited the least accuracy which is likely due to the function prediction approach using sequence similarity alone which can lead to inaccurate prediction.Highly similar enzymes belonging to the same family might have similar sequences, but may have different functions in vivo.The performance of Plant Reactome was also low in case of FLS annotation emphasizing the downfall of using only rice in curation and prediction, as genes in species distant from rice will be predicted less reliably and genes missing in rice cannot be predicted in other species.MetaCyc contains only experimentally evidenced genes, hence has high accuracy but low specificity.This also leads to annotation of a limited number of genes and species.
In summary, KIPEs3 showed the best performance for functional annotation of genes having both high accuracy as well as specificity.This makes KIPEs3 a good tool for annotating novel genome sequences with focus on a particular pathway with a substantial amount of prior knowledge about the involved genes in other species.(d-f) for the three species, respectively.Functional gene IDs are written in boldface.Comparison of FLS annotation (g-l) between KIPEs and three databases.The number of annotated (red dashed line) and correctly annotated (black solid line) FLS genes is displayed per tool/database for Arabidopsis thaliana (g), Cicer arietinum (h), and Citrus sinensis (i).The corresponding gene expression (TPM) of all candidate genes was investigated to differentiate between genes and potential pseudogenes (j-l) for the three species, respectively.Functional gene IDs are written in boldface.https://doi.org/10.1371/journal.pone.0294342.g004

KIPEs3 is available as an online service
KIPEs3 is provided as an online service hosted on a webserver at TU Braunschweig [85].Users can upload their FASTA files containing coding sequences or peptide sequences of a species.Bait sequence sets are stored on the server and can be directly used without the need to be manually uploaded by the user.Optional parameters are pre-configured to allow an automatic analysis with default settings, but can also be modified by the user through the web interface if necessary.Users may opt-in to receive email notifications once their analysis is finished and the results are available.Alternatively, users may choose to store a cookie to allow their browser to keep track of the progress.The archive provided for download includes all result files that users would get when running KIPEs3 locally.
The web server is implemented in Python3 using the Django web framework v4 [86].Whenever a user uploads a file and submits a new job through the form on the website, the form data is sent to the server via an HTTP POST request and the file is temporarily stored on the server machine.The job is placed into a queue and executed as soon as there are enough free resources (disk space, memory, and CPU cores) available.KIPEs3 and all its dependencies are installed on the server machine.The user is redirected to a page with a console showing the command that is executed, and the program output is displayed in real time using XMLHttpRequests.All user supplied data are deleted from the server 48 hours after the job has been finished and the results have been downloaded.

Analysis of the flavonoid biosynthesis in A. thaliana Nd-1, Capsella rubella, and Dioscorea dumetorum
First, a technical check was performed by analyzing the flavonoid biosynthesis in the A. thaliana accession Niederzenz-1 (Nd-1).It is expected that orthologs of all characterized genes of the reference accession Columbia-0 (Col-0) are also present in Nd-1, while the sequences are slightly different from the bait sequences.All genes known in Col-0 were also discovered in Nd-1 (S2 File).Next, Capsella grandiflora was analyzed with KIPEs3 to show that the identification of flavonoid biosynthesis genes works also in species that are not represented in most of the bait sequence sets like A. thaliana.Genes present in A. thaliana were also discovered in the close relative C. grandiflora (S3 File).To demonstrate the potential of KIPEs3 to annotate the flavonoid biosynthesis in a distant relative of A. thaliana, the flavonoid biosynthesis of Dioscorea dumetorum, which belongs to the monocots, was analyzed (S5 File).
The new flavonoid biosynthesis pathway steps added to the repertoire of KIPEs3 include the diverse and promiscuous decoration enzymes like glycosyltransferases, acyltransferases, and methyltransferases.Since some of these steps might be species-specific, it is possible that some steps are missing in a given plant species of interest.This requires specific attention when interpreting the KIPEs3 results to avoid false positives.While the aglycon biosynthesis is highly conserved and catalyzed by well characterized enzymes, flavonoid decorating enzymes are less studied and show a high level of promiscuity [87][88][89].This poses a challenge for the identification through sequence similarity.Little is known about the functionally relevant amino acids in the flavonoid decorating enzymes.In addition, this extension of the KIPEs3 flavonoid biosynthesis bait sequence sets is no longer restricted to enzymes, but also includes flavonoid transporters.They do not have an active center that could be composed of functionally relevant amino acids.This turns the reliable identification of transport-associated proteins into a remaining challenge.

Analysis of the carotenoid biosynthesis
A novel set of bait sequences was compiled to enable the annotation of the carotenoid biosynthesis with KIPEs3 (Table 3).The carotenoid biosynthesis is well studied in many plant species including A. thaliana [90].The set of compiled bait and reference sequences was evaluated by running the annotation workflow on A. thaliana Nd-1, Daucus carota, C. rubella, and D. dumetorum (Fig 5; S3 File).Daucus carota carotenoid biosynthesis genes previously described in the literature [91] were recovered by KIPEs3 (S6 File).This supports the accuracy of the candidate gene identification.

Transcriptional regulation of biosynthesis pathways
Many plant biosynthesis pathways are regulated at the transcriptional level by members of large transcription factor families.Examples are the R2R3-MYBs and bHLHs that form complexes involved in the control of the flavonoid biosynthesis [106].As we demonstrated previously, it is possible to analyze large gene families like MYBs and bHLHs with KIPEs [12].However, KIPEs3 is optimized for the analysis of pathways that comprise genes with low copy numbers.Most structural genes in these pathways are part of a small gene family.Large gene families like the P450 cytochromes (CYPs) or 2-oxoglutarate dioxygenases (2-ODDs) comprise multiple smaller subfamilies with distinct biological functions.Only the individual subfamilies are relevant for the analysis with KIPEs3.Ideally, the application of KIPEs3 should be restricted to the annotation of structural genes in a biosynthesis.While it is possible to investigate large transcription factor families, dedicated tools will perform better.Members of the MYB and bHLH gene families can be annotated by specific tools [47].General phylogenybased approaches can be deployed to study other involved transcription factor families [107,108].

Conclusions
While the generation of high-quality genome sequences is turning into a routine task, the structural and especially the functional annotation of these sequences is an increasing challenge.The ability to annotate biosynthesis pathways beyond the flavonoid biosynthesis positions KIPEs3 as a versatile platform for functional annotation in projects focused on specialized metabolites.We demonstrated the capabilities of KIPEs3 by comparing the annotation of FLS and LYC-b in three species against the results of other popular tools and databases.We showed that the use of broad taxonomic range bait sequences and functionally relevant amino acids allows KIPEs3 to recover more correct annotations per pathway than other tools that rely on general sequence similarity.KIPEs3 can easily annotate novel genome sequences with focus on a particular pathway while this is not possible with other annotation tools.Although we focused on plant metabolism in this study, KIPEs3 could also be applied for the investigation of non-plant organisms like fungi or bacteria.A sufficient amount of prior knowledge about the pathway in other species is the only requirement.The automatic annotation of biosynthesis pathways paves the way for comparative genomic studies that are looking for species-specific differences or intraspecific variation.

Fig 1 .
Fig 1. Simplified illustration of central KIPEs3 steps and functions that are included in the standalone and in the online version.Transcriptome or peptide sequences are supplied by the user.Initial candidates are identified by BLAST and checked through global alignments, phylogenetic trees, and the inspection of functionally relevant amino acid residues.Phylogenetic trees are constructed for all final candidates.The workflow illustration is an extended version of the one published with the initial KIPEs version[12].https://doi.org/10.1371/journal.pone.0294342.g001

Fig 3 .
Fig 3. KIPEs3 run time analysis based on 121 Phytozome data sets (A) revealed that the run time is depending on the number of sequences in the input data set (B). https://doi.org/10.1371/journal.pone.0294342.g003

Fig 4 .
Fig 4. Functional annotation tool comparison using the LYC-b gene from the carotenoid biosynthesis pathway and the FLS gene from the flavonoid biosynthesis pathway.Comparison of LYC-b annotation (a-f) between KIPEs and three databases.The number of annotated (red dashed line) and correctly annotated (black solid line) LYC-b genes is displayed per tool/database for Arabidopsis thaliana (a), Solanum lycopersicum (b), and Daucus carota (c).The corresponding gene expression (TPM) of all candidate genes was investigated to differentiate between genes and potential pseudogenes(d-f) for the three species, respectively.Functional gene IDs are written in boldface.Comparison of FLS annotation (g-l) between KIPEs and three databases.The number of annotated (red dashed line) and correctly annotated (black solid line) FLS genes is displayed per tool/database for Arabidopsis thaliana (g), Cicer arietinum (h), and Citrus sinensis (i).The corresponding gene expression (TPM) of all candidate genes was investigated to differentiate between genes and potential pseudogenes (j-l) for the three species, respectively.Functional gene IDs are written in boldface.
File.Accession numbers of the analyzed RNA-seq datasets.(TXT) S2 File.Multiple sequence alignment of the LYC-b candidates in three species (Arabidopsis thaliana, Solanum lycopersicum, and Daucus carota) annotated by KIPEs3, MetaCyc, Plant Reactome, and Plant Metabolic Network.The alignment was generated using MAFFTv7.Characteristic regions of plant LCY-bs are indicated above the sequences: Dinucleotide